Credit Card User Churn Prediction Project

by Michele Casalgrandi

Background

The Thera bank credit cards are a good source of revenues resulting from various fees charged by the bank (annual fees, balance transfer fees, late payment fees, interest charges, etc.)

However, the bank has seen a steep decline in the number of users of the credit cards with resulting loss of revenues.

The bank wants to find out what are the reasons for the drop in credit cards customers and have a predictive model to identify which customers are likely to drop the credit card.

Objectives

Data Dictionary

Exploratory Data Analysis (EDA)

Import libraries and load data set

There are missing values in Education_Level and Marital_Status

There is no clear pattern for observations with missing Marital_Status

There are no duplicates.

Column CLIENTNUM is unique for each row, we will drop it as it doesn't add value to the analysis or the models

Numerical Variables

Numerical Variables Observations

Categorical Variables

Categorical Variables Observations

Univariate EDA

Customer_Age

Customer_Age is normally distributed with a few outliers to the right.

Dependent_count

Dependent_count is roughly normally distributed

Months_on_book

Months_on_book is normally distributed with the exception of a large peak at 36 (three years) of 2436 customers. Only a few outliers.

Total_Relationship_Count

Most customers have three products closely followed by 4, 5, and 6 with similar counts.

Months_Inactive_12_mon

Surprisingly most customers have been inactive between one and three months with low counts for zero, four, five and six.

Contact_Count_12_mon

Roughly normally distributed with a right tail (six contacts)

Credit_Limit

Data is highly skewed to the right with two peaks at 1438 (507 obs) and 34516 (508 obs) possibly the minimum and maximum values for credit limit allowed by the bank.

We will log transform the variable. As log transform doesn't depend on data distribution we can do this prior to splitting the data between train and test.

Credit_Limit_log distribution is improved over the original Credit_Limit

Total_Revolving_Bal

Many customers (2470) have a revolving balance of zero. Also many (508) have a revolving balance of 2517 (the maximum value in the data set).

Aside from the peaks at the minimum and maximum, the rest of the data is normally distributed with a slight left skew.

Avg_Open_To_Buy

The data is highly skewed to the right.

Log transform will create a left skew to the data. We will instead to use a square root transform to reduce the skew.

As the square root transform is independent from the data distribution we can transform prior to splitting between train and test.

Although the right skew is not elimated is not as severe as in the original feature.

Data is heavily right skewed.

There are some large outliers

We will transform with square root.

Skew has been removed although there are long tails with some distant outliers

Total_Trans_Amt

Total_Trans_Amt has a multimodal distribution with a right skew.

Total_Trans_ct
Total_Ct_Chng_Q4_Q1

There is a right skew with a long right tail.

We will transform using square root.

The distribution is now more centered although there still are long tails with outliers.

Avg_Utilization_Ratio

There are 2470 customers with zero utilization (none of the available credit is used)

Avg_Utilization_Ratio

Many values are zero (2470).

Categorical variables univariate analysis
Gender

53% of customers are Female and 47% Male

Education_Level

Most customers have an education of 'Graduate' with 31% followed by 'High School' at 20%.

Marital_Status
Income_Category

Most customers have an income of 'Less than $40K' at 35% followed by '40-60K' at 18%.

11% of customers have associated an invalid entry ('abc'). We will consider those values as missing and will impute them.

Card_Category

The vast majority of customers have a 'Blue' credit card (93%) followed by 'Silver' at 5.5%

Attrition_Flag

16.1% of customer dropped their credit card.

By-Variate Analysis

Pairplots and correlation observations

"Total_Trans_Amt" vs "Attrition_Flag"])

Attrited customers tend to have a lower Total_Trans_Amount

"Avg_Open_To_Buy" vs "Attrition_Flag"

Distributions look similar with minor differences in the IQRs.

"Avg_Utilization_Ratio" vs "Attrition_Flag"

Attrited customers have significantly lower Avg_Utilization_Ratio. There is some overlap with the outliers of attrited customers.

"Total_Ct_Chng_Q4_Q1" vs "Attrition_Flag"

Median and IQRs for attrited customer are lower suggesting customers who have declining card usage are more likely to cancel the card.

"Total_Amt_Chng_Q4_Q1" vs "Attrition_Flag"

Attrited customers tend to have lower ratio

"Total_Revolving_Bal" vs "Attrition_Flag"

Attrited customers tend to have lower revolving balance. However, the range of values are the same as for existing customers.

"Credit_Limit" vs "Attrition_Flag"

There are only slight differences between distributions.

"Contacts_Count_12_mon" vs "Attrition_Flag"

Existing customers are mostly grouped between 1 and 4 contacts, while attrited customers are more distributed over the entire range (0 to 6).

"Months_Inactive_12_mon" vs "Attrition_Flag"

The distribution for attrited customers is more concentrated between 1 and 4 months of inactivity. This suggestes customers stop using the cards for a month or more before they drop it.

"Total_Relationship_Count" vs "Attrition_Flag"

Attrited customers tend to have or use less products than existing customers

"Months_on_book" vs "Attrition_Flag"

There are only minor differences in Months_on_book distributions between attrited and existing customers.

"Dependent_count" vs "Attrition_Flag"

Attrited customers are more concentrated between 1 and 4 dependents

"Customer_Age" vs "Attrition_Flag"

Age does not seem to influence significantly whether a customer is attrited or not.

"Card_Category" vs "Attrition_Flag"

Holders of Platinum cards seem more prone to cancel the credit card. However, there is a very small number of customers with Platinum cards.

"Gender" vs "Attrition_Flag"

Females are slightly more likely to drop the credit card.

"Education_Level" vs "Attrition_Flag"

Customes with education of 'Doctorate' are more likely to drop the credit card.

"Marital_Status", "Attrition_Flag"

There are only slight differences between attrition rates according to marital status.

"Income_Category", "Attrition_Flag"

Customers at the extremes of the income range ('Less than \$40K' and '\$120K') tend to drop the card at slightly higher rates

Data Preparation for Modeling

Income_Category invalid values

First we set 'abc' value to np.Nan

As the transformation doesn't depend on data distribution, we can do it before splitting the data.

Split data

Missing values treatment

We will use an imputer to replace the null values with the most frequent occurence of the category i.e. the mode

Models building

Model performance metrics

We consider False Negatives to have a higher impact and therefore we will optimize Recall

Utilities functions

Define six models and train them on the training data

Hyperparameters tuning

We will tune only the top three models: XGBoost, Gradient Boost, Adaboost

Adaboost

Performance of the tuned Adaboost for validation is similar to the model with default parameters.

Gradient boost tuning

Performance after tuning is now at 0.877 for recall and accuracy at 0.969

XGBoost tuning

Model performance is slightly higher (recall: 0.880 with CV vs 0.899 tuned)

Accuracy is high at 0.973.

Models comparison

The best recall is from the XGBoost tuned model with 0.899 for validation. For that model accuracy is high as well with 0.973

Data oversampling

We will now oversample the data to see if we can improve the models performance by balancing the target class.

Fit tuned and default models to oversampled data

Train models on undersampled data

Undersample the data

Compare scores of models fit on undersampled data

We will now assess the scores on test data set

Top three models performance on test data

Adaboost tuned: Recall on test data is 0.975, Accuracy is 0.941

XGBoost Default: Recall on test data is 0.966, Accuracy is 0.944

Default gradient boost: Recall 0.954, Accuracy 0.939

All the three top models meet the requirements of Recall > 0.95 and accuracy > 0.70

Features importance

We will use the best model (Adaboost) to assess the features importance

Features importance Observations

Build the model using a pipeline

We will use the tuned adaboost model to be used in the pipeline as it has the best performance on the test set.

As 'Credit_Limit' needs to be dropped there are no columns to transform with log.

Create transformers for preprocess the data
Create pipeline

Pipeline performance on test set created for pipeline

Pipeline Performance Observations

Summary and Business Recommendations

Observations from data analysis and features importances

Recommendations

Customers at risk should be targeted with incentives to increase purchases with the credit card.

Customers at risk should be targeted with incentives to raise the average revolving balance.